For this project, I am wondering whether the different in people’s income would influence the death cases by Covid-19in the US. For the first dateset, I choose to use Median Income for each state in the US provided by United State Census and the link is ‘https://www.census.gov/search-results.html?q=Median+income+&page=1&stateGeo=none&searchtype=web&cssp=SERP&_charset_=UTF-8’. For the second dateset, I choose to use the collection of Covid-19 cases and all-causes death cases in each state and county in the US provided by the CDC and the link is ‘https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-in-the-United-St/kn79-hsxy’.
I need to merge those two datasets by the variable ‘State’ to get a full dataset which is helpful for the further analysis. Then, I delete the comma occurred in some numerical number such as changing 14,500 to 14500 in order to better run the data in R. For the next step, I renamed certain variables that include ‘space’ like changing “urban rural code” to “urban_rual_code” as a whole word. Before providing some statistical result, the most important step is to check the missing value occurs in our data. For any observations with the missing value for the death cases, I just replaced them with 0. In order to better summary the key outcome by the variable state, I created new variables to reflect the total death cases in each state. Then, I created a table to show the details of each key variable. The table contains four variables which classified by State: number of counties, Income, COVID-19 death cases and all-caused death cases. For the data visualization, I plotted 4 graphs to show the association between each key variables. For example, I used draw a US map to show the density of COVID-19 death in each state and draw a scatter plot to reflect the linear association between Income and number of Covid-19 death cases.
Merge two dataset
We checked the dimension of our data and noticed that there are 3023 total observations and 14 different factors for each of our observation. Then, I did some summaries for the key variables such as Income, Covid-19 death cases and all caused death cases. I found the the lowest median income for people living in certain state is $45081 and highest median income for people living in certain state is $86420. Also, I noticed that the lowest death cases caused by COVID-19 is in Colorado which equals to 0 and highest death cases caused by COVID-19 in California which equals to 73920 and mean death cases caused by COVID-19 in the US is 20504 for any state. From the data visualization, The first plot shows that California, Florida, New York and Texas contains much more COVID-19 death than other states. For the second plot, we noticed that the range of Income between each state is relatively large which equals to 41339, Mississippi with the lowest median income which equals to 45081 and District of Columbia with the highest median income which equals to 86420. The third graph is about the association between different urban-rural classification and COVID-19 death cases. We found that there is a relative positive linear association, as the counties contains more population, the more COVID-19 death cases occurs. The last graph is the scatter plot for the association between Income and COVID-19 death. However, the pattern is not clear and looks like a normal distribution.